Goto

Collaborating Authors

 reading behavior


Tracing How Annotators Think: Augmenting Preference Judgments with Reading Processes

de Langis, Karin, Walker, William, Le, Khanh Chi, Kang, Dongyeop

arXiv.org Artificial Intelligence

We propose an annotation approach that captures not only labels but also the reading process underlying annotators' decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.


Developing a Dyslexia Indicator Using Eye Tracking

Cogan, Kevin, Ngo, Vuong M., Roantree, Mark

arXiv.org Artificial Intelligence

Dyslexia, affecting an estimated 10% to 20% of the global population, significantly impairs learning capabilities, highlighting the need for innovative and accessible diagnostic methods. This paper investigates the effectiveness of eye-tracking technology combined with machine learning algorithms as a cost-effective alternative for early dyslexia detection. By analyzing general eye movement patterns, including prolonged fixation durations and erratic saccades, we proposed an enhanced solution for determining eye-tracking-based dyslexia features. A Random Forest Classifier was then employed to detect dyslexia, achieving an accuracy of 88.58\%. Additionally, hierarchical clustering methods were applied to identify varying severity levels of dyslexia. The analysis incorporates diverse methodologies across various populations and settings, demonstrating the potential of this technology to identify individuals with dyslexia, including those with borderline traits, through non-invasive means. Integrating eye-tracking with machine learning represents a significant advancement in the diagnostic process, offering a highly accurate and accessible method in clinical research.


LLMs as Educational Analysts: Transforming Multimodal Data Traces into Actionable Reading Assessment Reports

Davalos, Eduardo, Zhang, Yike, Srivastava, Namrata, Salas, Jorge Alberto, McFadden, Sara, Cho, Sun-Joo, Biswas, Gautam, Goodwin, Amanda

arXiv.org Artificial Intelligence

Reading assessments are essential for enhancing students' comprehension, yet many EdTech applications focus mainly on outcome-based metrics, providing limited insights into student behavior and cognition. This study investigates the use of multimodal data sources -- including eye-tracking data, learning outcomes, assessment content, and teaching standards -- to derive meaningful reading insights. We employ unsupervised learning techniques to identify distinct reading behavior patterns, and then a large language model (LLM) synthesizes the derived information into actionable reports for educators, streamlining the interpretation process. LLM experts and human educators evaluate these reports for clarity, accuracy, relevance, and pedagogical usefulness. Our findings indicate that LLMs can effectively function as educational analysts, turning diverse data into teacher-friendly insights that are well-received by educators. While promising for automating insight generation, human oversight remains crucial to ensure reliability and fairness. This research advances human-centered AI in education, connecting data-driven analytics with practical classroom applications.


Beyond Instructed Tasks: Recognizing In-the-Wild Reading Behaviors in the Classroom Using Eye Tracking

Davalos, Eduardo, Salas, Jorge Alberto, Zhang, Yike, Srivastava, Namrata, Thatigotla, Yashvitha, Gonzales, Abbey, McFadden, Sara, Cho, Sun-Joo, Biswas, Gautam, Goodwin, Amanda

arXiv.org Artificial Intelligence

Understanding reader behaviors such as skimming, deep reading, and scanning is essential for improving educational instruction. While prior eye-tracking studies have trained models to recognize reading behaviors, they often rely on instructed reading tasks, which can alter natural behaviors and limit the applicability of these findings to in-the-wild settings. Additionally, there is a lack of clear definitions for reading behavior archetypes in the literature. We conducted a classroom study to address these issues by collecting instructed and in-the-wild reading data. We developed a mixed-method framework, including a human-driven theoretical model, statistical analyses, and an AI classifier, to differentiate reading behaviors based on their velocity, density, and sequentiality. Our lightweight 2D CNN achieved an F1 score of 0.8 for behavior recognition, providing a robust approach for understanding in-the-wild reading. This work advances our ability to provide detailed behavioral insights to educators, supporting more targeted and effective assessment and instruction.


Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

Chobey, Aryaman, Smith, Oliver, Wang, Anzi, Prasad, Grusha

arXiv.org Artificial Intelligence

The use of neural language models to model human behavior has met with mixed success. While some work has found that the surprisal estimates from these models can be used to predict a wide range of human neural and behavioral responses, other work studying more complex syntactic phenomena has found that these surprisal estimates generate incorrect behavioral predictions. This paper explores the extent to which the misalignment between empirical and model-predicted behavior can be minimized by training models on more developmentally plausible data, such as in the BabyLM Challenge. We trained teacher language models on the BabyLM "strict-small" dataset and used sentence level surprisal estimates from these teacher models to create a curriculum. We found tentative evidence that our curriculum made it easier for models to acquire linguistic knowledge from the training data: on the subset of tasks in the BabyLM challenge suite evaluating models' grammatical knowledge of English, models first trained on the BabyLM data curriculum and then on a few randomly ordered training epochs performed slightly better than models trained on randomly ordered epochs alone. This improved linguistic knowledge acquisition did not result in better alignment with human reading behavior, however: models trained on the BabyLM dataset (with or without a curriculum) generated predictions that were as misaligned with human behavior as models trained on larger less curated datasets. This suggests that training on developmentally plausible datasets alone is likely insufficient to generate language models capable of accurately predicting human language processing.


Lower Perplexity is Not Always Human-Like

Kuribayashi, Tatsuki, Oseki, Yohei, Ito, Takumi, Yoshida, Ryo, Asahara, Masayuki, Inui, Kentaro

arXiv.org Artificial Intelligence

In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization -- the lower perplexity a language model has, the more human-like the language model is -- in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models.


Modeling Task Effects in Human Reading with Neural Network-based Attention

Hahn, Michael, Keller, Frank

arXiv.org Artificial Intelligence

Research on human reading has long documented that reading behavior shows task-specific effects, but it has been challenging to build general models predicting what reading behavior humans will show in a given task. We introduce NEAT, a computational model of the allocation of attention in human reading, based on the hypothesis that human reading optimizes a tradeoff between economy of attention and success at a task. Our model is implemented using contemporary neural network modeling techniques, and makes explicit and testable predictions about how the allocation of attention varies across different tasks. We test this in an eyetracking study comparing two versions of a reading comprehension task, finding that our model successfully accounts for reading behavior across the tasks. Our work thus provides evidence that task effects can be modeled as optimal adaptation to task demands.


Faithfully Explaining Rankings in a News Recommender System

ter Hoeve, Maartje, Schuth, Anne, Odijk, Daan, de Rijke, Maarten

arXiv.org Artificial Intelligence

There is an increasing demand for algorithms to explain their outcomes. So far, there is no method that explains the rankings produced by a ranking algorithm. To address this gap we propose LISTEN, a LISTwise ExplaiNer, to explain rankings produced by a ranking algorithm. To efficiently use LISTEN in production, we train a neural network to learn the underlying explanation space created by LISTEN; we call this model Q-LISTEN. We show that LISTEN produces faithful explanations and that Q-LISTEN is able to learn these explanations. Moreover, we show that LISTEN is safe to use in a real world environment: users of a news recommendation system do not behave significantly differently when they are exposed to explanations generated by LISTEN instead of manually generated explanations.